160 research outputs found

    On the design and implementation of broadcast and global combine operations using the postal model

    Get PDF
    There are a number of models that were proposed in recent years for message passing parallel systems. Examples are the postal model and its generalization the LogP model. In the postal model a parameter λ is used to model the communication latency of the message-passing system. Each node during each round can send a fixed-size message and, simultaneously, receive a message of the same size. Furthermore, a message sent out during round r will incur a latency of hand will arrive at the receiving node at round r + λ - 1. Our goal in this paper is to bridge the gap between the theoretical modeling and the practical implementation. In particular, we investigate a number of practical issues related to the design and implementation of two collective communication operations, namely, the broadcast operation and the global combine operation. Those practical issues include, for example, 1) techniques for measurement of the value of λ on a given machine, 2) creating efficient broadcast algorithms that get the latency hand the number of nodes n as parameters and 3) creating efficient global combine algorithms for parallel machines with λ which is not an integer. We propose solutions that address those practical issues and present results of an experimental study of the new algorithms on the Intel Delta machine. Our main conclusion is that the postal model can help in performance prediction and tuning, for example, a properly tuned broadcast improves the known implementation by more than 20%

    Non-volatile spin wave majority gate at the nanoscale

    Full text link
    A spin wave majority fork-like structure with feature size of 40\,nm, is presented and investigated, through micromagnetic simulations. The structure consists of three merging out-of-plane magnetization spin wave buses and four magneto-electric cells serving as three inputs and an output. The information of the logic signals is encoded in the phase of the transmitted spin waves and subsequently stored as direction of magnetization of the magneto-electric cells upon detection. The minimum dimensions of the structure that produce an operational majority gate are identified. For all input combinations, the detection scheme employed manages to capture the majority phase result of the spin wave interference and ignore all reflection effects induced by the geometry of the structure

    A Scalable Implementation of Fault Tolerance for Massively Parallel Systems

    Get PDF
    For massively parallel systems, the probability of cr s~Yslenc failure clue to u random hardware fault becomes statistically very significant because of the huge number of components. Besides, filult injection experiments show that multiple failures go undetected, leading to incorrect results. Hence, massively parallel systems reguirc abilities to tolerate: these faults that will occur. The FTMPS project presents a scalable implementation to integrate the different steps to,laull tolerance into existing HPC systems . On the initial parallel .system only 4017v of (randomly injected),faulls do not cause the application to crash or produce wrong results . 1n. the resulting FTMPS prototype more than. 80%, of these ftiults are correctly detected and recovered. Resulting overhead for the application is only between 10 and 20%. Evaluation. of the different, co-operating fault tolerance modules shows the,llexibility and the ,.scalability of the approach.This project is partly sponsored by ESPRIT project 6731 (FTMPS): "Fault Tolerance in Massively Parallel Systems" . Geert Deconinck and Johan Vounckx have a grant from the Flemish Institute for the Advancement of Scientific and Technological Research in Industry (IWT). Rudy Lauwereins is a Senior Research Associate of the Belgian Fund for Scientific Research

    Fast Prototyping and Refinement of Complex Dynamic Data Types in Multimedia Applications for Consumer Devices

    Get PDF
    Portable consumer devices are increasing more and more their capabilities and can now implement new multimedia algorithms that were resewed only for powerful workstations few years ago. Unfortunately, the original design characteristics of such algorithms do not often allow to port them directly to current embedded devices. These algorithms share complex and intensive dynamic memory use and actual embedded systems cannot provide efficient general-purpose memory management as it is needed. As a result, dynamic memory optimizations are a requirement when porting these applications. Within these optimizations, the refinement of the dynamically (de)allocated abstract data type implementations in the complex multimedia applications involved is one of the most important and difficult parts for an efficient mapping of the algorithms on low-power and high-speed embedded consumer devices. In this paper, we describe a high-level approach for modeling and refining complex data types wing abstract derived classes in C++. This approach enables the multimedia developer to compose, evaluate and refine complex data types in a conceptually straightforward way, without a time-consuming programming effort

    Methodology for Refinement and Optimization of Dynamic Memory Management for Embedded Systems in Multimedia Applications

    Get PDF
    In multimedia applications, run-time memory management support has to allow real-time memory de/allocation, retrieving and processing of data. Thus, its implementation must be designed to combine high speed, low power, large data storage capacity and a high memory bandwidth. In this paper, we assess the performance of our new system-level exploration methodology to optimise the memory management of typical multimedia applications in an extensively used 3D reconstruction image system. This methodology is based on an analysis of the number of memory accesses, normalised memory footprint1 and energy estimations for the system studied. This results in an improvement of normalised memory footprint up to 44.2% and the estimated energy dissipation up to 22.6% over conventional static memory implementations in an optimised version of the driver application. Finally, our final version is able to scale perfectly the memory consumed in the system for a wide range of input parameters whereas the statically optimised version is unable to do this

    Inversion optimization in majority-inverter graphs

    Get PDF
    Many emerging nanotechnologies realize majority gates as primitive building blocks and they benefit from a majority-based synthesis. Recently, Majority-Inverter Graphs (MIGs) have been introduced to abstract these new technologies. We present optimization techniques for MIGs that aim at rewriting the complemented edges of the graph without changing its shape. We demonstrate the performance of our optimization techniques by considering three cases of emerging technology design: semi-custom digital design using Spin Wave Devices (SWDs) and Quantum-Dot Cellular Automata (QCA); and logic in-memory operation within Resistive Random Access Memories (RRAMs). Our experimental results show that SWD and QCA technologies benefit from complemented edges minimization. Area, delay, and power of SWD-based circuits are improved by 13.8%, 21.1%, and 9.2% respectively, while the number of QCA cells in QCA-based circuits can be decreased by 4.9% on average. Reductions of 14.4% and 12.4% in the number of devices and sequential steps respectively can be achieved for RRAMs when the number of nodes with exactly one complemented input is increased during MIG optimization
    • 

    corecore